Allow Bench To Configure Data Processing Pipeline Per Scenario #60

fabianlim · 2024-07-31T13:36:38Z

This PR allows bench to have a data_processing stanza, currently we have two styles

a functional recipe style, where the formatting functions need to be implemented in python.
a jinja style, where the formatting is given in a template

Loss Masking

loss masking is done automatically if --response_template is passed into the arguments.

Older Style

In this style, the recipe is specified by the formatting flag. This requires updating python code to understand
what needs to be done for that particular recipe

also for the recipe to understand the fields of the dataset, you need to specify input_field and so on
this style is not so good, as everything is opaque. It requires opening the data processing functions to know exactly what processing is happening.

data_processing:
  dataset_name: yahma/alpaca-cleaned
  formatting: "instruct"
  tokenize: True
  input_field: input

Chat Templates

In this style we rely on HF's integration of chat templating.

this is more flexible and the preferred approach.

data_processing:
  dataset_name: yahma/alpaca-cleaned
  chat_template: |
    {%- for message in messages %}
        {% if message['input'] != '' %}
    Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

        {% else %}
    Below is an instruction that describes a task. Write a response that appropriately completes the request.

        {% endif %}
    ### Instruction:
    {{ message['instruction'] }}

        {% if message['input'] != '' %}
    ### Input:
    {{ message['input'] }}

        {% endif %}
    ### Response:
    {{ message['output'] + eos_token }}
    {% endfor %}
  tokenize: True

Signed-off-by: Yu Chin Fabian Lim <[email protected]>

fabianlim requested a review from achew010 July 31, 2024 13:39

fabianlim marked this pull request as draft July 31, 2024 13:39

fabianlim force-pushed the bench-tokenize branch 3 times, most recently from 7d200de to b3f43b7 Compare August 1, 2024 08:23

fabianlim added 5 commits August 1, 2024 16:30

allow for data formatting and tokenization during bench

0a5e677

Signed-off-by: Yu Chin Fabian Lim <[email protected]>

added chat template support

8b52c43

Signed-off-by: Yu Chin Fabian Lim <[email protected]>

cleanup

a653bbc

Signed-off-by: Yu Chin Fabian Lim <[email protected]>

lint

0439129

Signed-off-by: Yu Chin Fabian Lim <[email protected]>

config fixes

0e70741

Signed-off-by: Yu Chin Fabian Lim <[email protected]>

fabianlim force-pushed the bench-tokenize branch from b3f43b7 to 0e70741 Compare August 1, 2024 16:40

fabianlim marked this pull request as ready for review August 1, 2024 16:41

fabianlim merged commit 0e51785 into main Aug 2, 2024
6 checks passed

fabianlim deleted the bench-tokenize branch August 2, 2024 02:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow Bench To Configure Data Processing Pipeline Per Scenario #60

Allow Bench To Configure Data Processing Pipeline Per Scenario #60

fabianlim commented Jul 31, 2024 •

edited

Loading

Allow Bench To Configure Data Processing Pipeline Per Scenario #60

Allow Bench To Configure Data Processing Pipeline Per Scenario #60

Conversation

fabianlim commented Jul 31, 2024 • edited Loading

fabianlim commented Jul 31, 2024 •

edited

Loading